-
-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Improve process termination logic in multiprocess manager #2371
base: master
Are you sure you want to change the base?
feat: Improve process termination logic in multiprocess manager #2371
Conversation
I will add unit tests later, but I have no experience in designing a process that will hang, so if anyone can help, it would be greatly appreciated. |
I'll try to check this over the weekend. Sorry the delay. For next time, your PRs always have preference @abersheeran , please ping me if I take long. |
I'm not sure if we should use the same |
Do you have any new ideas? We really need a configurable timeout here. |
uvicorn/supervisors/multiprocess.py
Outdated
def join(self, join_timeout: float | None = None) -> None: | ||
logger.info(f"Waiting for child process [{self.process.pid}]") | ||
self.process.join() | ||
self.process.join(join_timeout) | ||
# Timeout, kill the process | ||
while self.process.exitcode is None: | ||
self.process.kill() | ||
self.process.join(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have a join(1)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait for the kill command to take effect. If it does not take effect within 1 second, send the kill command again.
Co-authored-by: Marcelo Trylesinski <[email protected]>
self.process.join(timeout) | ||
# Timeout, kill the process | ||
while self.process.exitcode is None: | ||
self.process.kill() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why CI failed is that this is not covered by the test. But I don't know how to design a process that will be 100% stuck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's okay to add the pragma here.
Summary
About #2369
In our cluster, we accidentally discovered the zombie process. We found the reason. Uvicorn's new process manager will JOIN child processes one by one after sending all exit signals. When the previous child process does not exit for a long time, the subsequent child processes cannot be JOIN.
I noticed that Uvicorn has an inherent shutdown timeout, which would be nice if we could use it with a multiprocessor.
The reason why we don't use terminate&join sequentially is to kill all processes faster, as mentioned in this PR #2010
Checklist